Cleaning the weather dataset

In this notebook, we’ll be using numpy and pandas.

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.

Numpy is the fundamental package for scientific computing with Python.

Let’s install the packages pandas and numpy.

## Uncomment and run this cell to install pandas and numpy
#!pip install pandas numpy

# import the libraries
import pandas as pd
import numpy as np
from dataidea.datasets import loadDataset

Let’s check the versions of python, numpy and pandas we’ll be using for this notebook

# checking python version
print('Python Version: ',)
!python --version

Python Version: 
Python 3.10.12

# Checking numpy and pandas versions
print('Pandas Version: ', pd.__version__)
print('Numpy Version: ', np.__version__)

Pandas Version:  2.2.1
Numpy Version:  1.26.4

Let’s load the dataset. We’ll be using a weather dataset that imagined for learning purposes.

# load the dataset
weather_data = loadDataset('weather')

We can sample out random rows from the dataset using the sample() method, we can use the n parameter to specify the number of rows to sample

# sample out random values from the dataset
weather_data.sample(n=5)

	day	temperature	windspead	event
2	05/01/2017	28.0	NaN	Snow
8	11/01/2017	40.0	12.0	Sunny
7	10/01/2017	34.0	8.0	Cloudy
1	04/01/2017	NaN	9.0	Sunny
4	07/01/2017	32.0	NaN	Rain

Display some info about the dataset eg number of entries, count of non-null values and variable datatypes using the info() method

# get quick dataframe info
weather_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   day          9 non-null      object 
 1   temperature  5 non-null      float64
 2   windspead    5 non-null      float64
 3   event        7 non-null      object 
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes

We can count all missing values in each column in our dataframe by using dataframe.isna().sum(), eg

# count missing values in each column
weather_data.isna().sum()

day            0
temperature    4
windspead      4
event          2
dtype: int64

We can use a boolean-indexing like technique to find all rows in a dataset with missing values in a specific column.

# get rows with missing data in temperature
weather_data[weather_data.temperature.isna()]

	day	temperature	windspead	event
1	04/01/2017	NaN	9.0	Sunny
3	06/01/2017	NaN	7.0	NaN
5	08/01/2017	NaN	NaN	Sunny
6	09/01/2017	NaN	NaN	NaN

# get rows with missing data in event column
weather_data[weather_data.event.isna()]

	day	temperature	windspead	event
3	06/01/2017	NaN	7.0	NaN
6	09/01/2017	NaN	NaN	NaN

For the next part, we would like to demonstrate forward fill (ffill()) and backward fill (bfill), we first create two copies of the dataframe to avoid modifying our original copy in memory. - ffill() fills the missing values with the previous valid value in the column - bfill() fills the missing values with the next valid value in the column

# Create copies of a dataframe
weather_data1 = weather_data.copy()
weather_data2 = weather_data.copy()

# fill with the previous valid value
weather_data1['event'] = weather_data1.event.ffill()
weather_data1

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
1	04/01/2017	NaN	9.0	Sunny
2	05/01/2017	28.0	NaN	Snow
3	06/01/2017	NaN	7.0	Snow
4	07/01/2017	32.0	NaN	Rain
5	08/01/2017	NaN	NaN	Sunny
6	09/01/2017	NaN	NaN	Sunny
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny

weather_data

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
1	04/01/2017	NaN	9.0	Sunny
2	05/01/2017	28.0	NaN	Snow
3	06/01/2017	NaN	7.0	NaN
4	07/01/2017	32.0	NaN	Rain
5	08/01/2017	NaN	NaN	Sunny
6	09/01/2017	NaN	NaN	NaN
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny

# fill with the next valid value in the column
weather_data2['event'] = weather_data2.event.bfill()
weather_data2

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
1	04/01/2017	NaN	9.0	Sunny
2	05/01/2017	28.0	NaN	Snow
3	06/01/2017	NaN	7.0	Rain
4	07/01/2017	32.0	NaN	Rain
5	08/01/2017	NaN	NaN	Sunny
6	09/01/2017	NaN	NaN	Cloudy
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny

We can modify (or fill) a specific value in the dataframe by using the loc[] method. This picks the value by its row (index) and column names. Assigning it a new value modifies it in the dataframe as illustrated below

# modify a specific value in the dataframe
weather_data2.loc[1, 'temperature'] = 29
weather_data2

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
1	04/01/2017	29.0	9.0	Sunny
2	05/01/2017	28.0	NaN	Snow
3	06/01/2017	NaN	7.0	Rain
4	07/01/2017	32.0	NaN	Rain
5	08/01/2017	NaN	NaN	Sunny
6	09/01/2017	NaN	NaN	Cloudy
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny

We can use the fillna() method to replace all missing values in a column with a specific value as demostrated value

# replace missing values in temperature column with mean
weather_data2['temperature'] = weather_data2.temperature.fillna(
    value=weather_data2.temperature.mean()
)
weather_data2

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
1	04/01/2017	29.0	9.0	Sunny
2	05/01/2017	28.0	NaN	Snow
3	06/01/2017	32.5	7.0	Rain
4	07/01/2017	32.0	NaN	Rain
5	08/01/2017	32.5	NaN	Sunny
6	09/01/2017	32.5	NaN	Cloudy
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny

# create a copy of weather_data2
weather_data22 = weather_data2.copy()

# Replace missing values in windspead column with a specific value
weather_data22['windspead'] = weather_data2.windspead.fillna(value=7.5)
weather_data22

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
1	04/01/2017	29.0	9.0	Sunny
2	05/01/2017	28.0	7.5	Snow
3	06/01/2017	32.5	7.0	Rain
4	07/01/2017	32.0	7.5	Rain
5	08/01/2017	32.5	7.5	Sunny
6	09/01/2017	32.5	7.5	Cloudy
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny

We can also use the fillna() method to fill missing values in multiple columns by passing in the dictionary of key/value pairs of column-name and value to replace. Before we demonstrate this, let’s create a copy of the dataframe to avoid modifying the original in memory

# create a copy of the weather_data dataframe
weather_data3 = weather_data.copy()

# Replace missing values in temperature, column and event
weather_data3.fillna(value={
    'temperature': weather_data3.temperature.mean(), 
    'windspead': weather_data3.windspead.max(), 
    'event': weather_data3.event.bfill()
    }, inplace=True)

weather_data3

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
1	04/01/2017	33.2	9.0	Sunny
2	05/01/2017	28.0	12.0	Snow
3	06/01/2017	33.2	7.0	Rain
4	07/01/2017	32.0	12.0	Rain
5	08/01/2017	33.2	12.0	Sunny
6	09/01/2017	33.2	12.0	Cloudy
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny

We can optionally drop all rows with missing values using the dropna() method. Before we demonstrate that, let’s first create a dataframe copy to avoid modifying the original in the memory

# create a copy of weather_data dataframe
weather_data4 = weather_data.copy()

# Drop all rows with missing values
weather_data4.dropna()

	day	temperature	windspead	event
0	01/01/2017	32.0	6.0	Rain
7	10/01/2017	34.0	8.0	Cloudy
8	11/01/2017	40.0	12.0	Sunny